GVPT Maths Camp

Data Wrangling

Learning objectives

  1. Import your data

  2. Clean your data

  3. Explore relational data

Data wrangling

Source: R4DS

World Bank GDP data

https://data.worldbank.org/indicator/NY.GDP.MKTP.CD

  • Official source of global and historical GDP data

  • A very common control variable for IR and CP analysis

  • Very frustratingly messy!

Download global GDP data

Source: World Bank

Working with external data

Introducing here::here()

install.packages("here")


Points to where you are on your computer. Updates for everyone on any computer!

here::here()
[1] "/Users/harrietgoers/Documents/intro_to_r_ps"


EXERCISE: See where you are!

Reading in your csv

  • I like to store raw data in a folder called data-raw within my project.

  • I then store any clean data that is ready for analysis in a data folder within my project.

You will start to see this structure throughout my scripts. You should come up with something that works for you. It should be intuitive. This will help future you and any other people who want to explore your work navigate your projects.

Reading in your csv

library(tidyverse)

gdp_raw <- read_csv(here::here("slides", "data-raw", "wb_gdp.csv"))

head(gdp_raw)
# A tibble: 6 × 3
  `Data Source`               `World Development Indicators` ...3               
  <chr>                       <chr>                          <chr>              
1 Last Updated Date           2023-03-01                      <NA>              
2 Country Name                Country Code                   "Indicator Name,In…
3 Aruba                       ABW                            "GDP (current US$)…
4 Africa Eastern and Southern AFE                            "GDP (current US$)…
5 Afghanistan                 AFG                            "GDP (current US$)…
6 Africa Western and Central  AFW                            "GDP (current US$)…


EXERCISES

  1. Read the ?read_csv help file. What arguments does this function take?

  2. Head to the readr package documentation and find what other file types you can read in.

Skipping non-relevant rows

Source: World Bank

Skipping non-relevant rows

gdp_raw <- read_csv(
  here::here("slides", "data-raw", "wb_gdp.csv"), skip = 4, col_select = 1:66
)

head(gdp_raw)
# A tibble: 6 × 66
  `Country Name`       `Country Code` `Indicator Name` `Indicator Code`   `1960`
  <chr>                <chr>          <chr>            <chr>               <dbl>
1 Aruba                ABW            GDP (current US… NY.GDP.MKTP.CD   NA      
2 Africa Eastern and … AFE            GDP (current US… NY.GDP.MKTP.CD    2.13e10
3 Afghanistan          AFG            GDP (current US… NY.GDP.MKTP.CD    5.38e 8
4 Africa Western and … AFW            GDP (current US… NY.GDP.MKTP.CD    1.04e10
5 Angola               AGO            GDP (current US… NY.GDP.MKTP.CD   NA      
6 Albania              ALB            GDP (current US… NY.GDP.MKTP.CD   NA      
# ℹ 61 more variables: `1961` <dbl>, `1962` <dbl>, `1963` <dbl>, `1964` <dbl>,
#   `1965` <dbl>, `1966` <dbl>, `1967` <dbl>, `1968` <dbl>, `1969` <dbl>,
#   `1970` <dbl>, `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>,
#   `1975` <dbl>, `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>,
#   `1980` <dbl>, `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>,
#   `1985` <dbl>, `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>,
#   `1990` <dbl>, `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, …

EXERCISES

Taking a look at this data set:

skimr::skim(gdp_raw)


What types of data do we have? Are they the right type of data?


Are we missing data points?

Tidy Data Structures

Source: R4DS

Tidying World Bank data

What do you want to do with your data?

  • I want to analyse country, regional, and global trends in GDP over time

What I need:

  • Annual data on each country’s GDP

  • The region to which each country belongs

Tidying World Bank data

To do:

  • Move the yearly data from columns to rows

  • Clean up these column names so that they are easier to use in R

  • Add regional data

Pivoting your data

gdp_df <- pivot_longer(
  data = gdp_raw, 
  cols = `1960`:`2021`,
  names_to = "year",
  values_to = "gdp"
)


colnames(gdp_raw)
 [1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
 [5] "1960"           "1961"           "1962"           "1963"          
 [9] "1964"           "1965"           "1966"           "1967"          
[13] "1968"           "1969"           "1970"           "1971"          
[17] "1972"           "1973"           "1974"           "1975"          
[21] "1976"           "1977"           "1978"           "1979"          
[25] "1980"           "1981"           "1982"           "1983"          
[29] "1984"           "1985"           "1986"           "1987"          
[33] "1988"           "1989"           "1990"           "1991"          
[37] "1992"           "1993"           "1994"           "1995"          
[41] "1996"           "1997"           "1998"           "1999"          
[45] "2000"           "2001"           "2002"           "2003"          
[49] "2004"           "2005"           "2006"           "2007"          
[53] "2008"           "2009"           "2010"           "2011"          
[57] "2012"           "2013"           "2014"           "2015"          
[61] "2016"           "2017"           "2018"           "2019"          
[65] "2020"           "2021"          
colnames(gdp_df)
[1] "Country Name"   "Country Code"   "Indicator Name" "Indicator Code"
[5] "year"           "gdp"           

Clean column names

Column names should not:

  • Have spaces

  • Start with numbers

Introducing janitor:

# Install the `janitor` package

install.packages("janitor")


gdp_df <- janitor::clean_names(gdp_df)

colnames(gdp_df)
[1] "country_name"   "country_code"   "indicator_name" "indicator_code"
[5] "year"           "gdp"           

Add region data

Introducing the countrycode package - the indispensable workhorse of country data:

# Install the `countrycode` package

install.packages("countrycode")

Add region data

library(countrycode)

# Add each country's World Bank region to the data set

gdp_df <- gdp_df |> 
  mutate(
    region = countrycode(country_name, 
                         "country.name", 
                         "region", 
                         custom_match = c("Turkiye" = "Europe & Central Asia"))
  ) |> 
  # Remove observations that are regions
  drop_na(region) |> 
  relocate(region, .after = "country_code")

Add region data

gdp_df |> 
  distinct(country_name, region) |> 
  head(10)
# A tibble: 10 × 2
   country_name         region                    
   <chr>                <chr>                     
 1 Aruba                Latin America & Caribbean 
 2 Afghanistan          South Asia                
 3 Angola               Sub-Saharan Africa        
 4 Albania              Europe & Central Asia     
 5 Andorra              Europe & Central Asia     
 6 United Arab Emirates Middle East & North Africa
 7 Argentina            Latin America & Caribbean 
 8 Armenia              Europe & Central Asia     
 9 American Samoa       East Asia & Pacific       
10 Antigua and Barbuda  Latin America & Caribbean 

Make sure all data are the right type

glimpse(gdp_df)
Rows: 13,454
Columns: 7
$ country_name   <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "…
$ country_code   <chr> "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW", "ABW",…
$ region         <chr> "Latin America & Caribbean", "Latin America & Caribbean…
$ indicator_name <chr> "GDP (current US$)", "GDP (current US$)", "GDP (current…
$ indicator_code <chr> "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "NY.GDP.MKTP.CD", "…
$ year           <chr> "1960", "1961", "1962", "1963", "1964", "1965", "1966",…
$ gdp            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

Make sure all data are the right type

gdp_df <- transmute(
  gdp_df,
  country_name, 
  region, 
  year = as.integer(year),
  gdp
)

glimpse(gdp_df)
Rows: 13,454
Columns: 4
$ country_name <chr> "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Aruba", "Ar…
$ region       <chr> "Latin America & Caribbean", "Latin America & Caribbean",…
$ year         <int> 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 196…
$ gdp          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

How do countries compare over time?

ggplot(gdp_df, aes(
  x = year, y = gdp, colour = region, group = country_name
)) + 
  geom_line() + 
  theme_minimal()

How do regions compare over time?

By default, R will carry forward NAs. This is good!

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

gdp_df |> 
  group_by(region, year) |> 
  summarise(avg_gdp = mean(gdp, na.rm = T)) |> 
  ggplot(aes(x = year, y = avg_gdp, colour = region)) + 
  geom_line() + 
  theme_minimal()

Dealing with missing data

Summary

Today you:

  • Learnt how to read in external data

  • Learnt how to clean up common problems using R